Image Captioning - CNN encoding-->Attention-->LSTM decoding

Designing a Image Captioning System using Attention Mechanism

For example:

Caption: A surfer riding on a wave.

It is important to have both files for this project.

Flickr8k_Dataset

Flickr8k_text

How it will work :

To generate a caption at test time we simply feed an image to the model with a "start sequence" input, and use the most likely predicted next word sequentially until reaching an "end sequence" character.And the merge can be either addition or concatenation.

Mounting Google Drive locally

Loading all the dependencies

Utility Functions to load and clean data

Exploratory Data Analysis(EDA)

Making a dataframe out of raw text

Some Data cleaning :

Let's explore the dataframe

Let's plot images alongwith there captions

[3] Preprocessing

[3.2] Captions

Cleaning the captions

Preprocessing the images

Let's view few image examples

Taking only 40,000 images and captions

Save variables in a pickle file and restore them to use it again (Process similar to one used in baseline)

Converting every image to shape (224,224,3) as we have used VGG-16

Creating functions and trying Image Augmentation

Resizing of images

Image Histograms

Pre-Trained Image Model (VGG16)

The following creates an instance of the VGG16 model using the Keras API. This automatically downloads the required files if you don't have them already.

The VGG16 model was pre-trained on the ImageNet data-set for classifying images. The VGG16 model contains a convolutional part and a fully-connected (or dense) part which is used for the image classification.

If include_top=True then the whole VGG16 model is downloaded which is about 528 MB. If include_top=False then only the convolutional part of the VGG16 model is downloaded which is just 57 MB.

Features extractraction from VGG16

Map each image name to the function to load the image

After this step all the images have been resized to (224,224,3)

Data Preparation for a Language Generation(RNN) DECODER

Tokenizing the captions and creating vocabulary

MaxLength and MinLength of any captions

Pad each vector to the max_length of the captions, so that all caption vectors are of same length

Train-Test Split (80:20)

Creating dataset

Loading the .npy files

CNN ENCODER:

Here we provide you the choice of using Local Attention or Global Attention for the Decoder

RNN DECODER and LOCAL ATTENTION MECHANISM:

RNN DECODER AND GLOBAL ATTENTION

Saving Checkpoint

Training Step:

Setup Tensorboard summary writer

Training

Launching Tensorboard

Plot the Train and Validation Losses to check for overfitting

Evaluating the Captioning Model:

Given below are two methods to evaluate the captions

  1. Greedy Approach
  2. Beam Search

Choose any one evaluate function as per your need

Greedy Approach

Creating a helper function to visualise the attention points that predicts the words.

Beam Search(b=3)

Beam Search(b=7)

Beam Search(b=10)

Beam Search(b=3)

Beam Search(b=7)

Beam Search(b=10)

Try captions on Validation Set

Beam Search(b=3)

Beam Search(b=7)

Beam Search(b=10)

Beam Search(b=3)

Beam Search(b=7)

Beam Search(b=10)

Beam Search(b=3)

Beam Search(b=7)

Beam Search(b=10)